a post-clustering differential expression method
July 5, 2024
What is differential expression?
What are the problems with existing methods?
How does ClusterDE solve these problems?
Does ClusterDE really solve these problems?
What are other variations on ClusterDE that we might consider?
Cell-by-gene matrix:
\[ \begin{bmatrix} Y_{11} & \dots & Y_{1m} \\ \vdots & \ddots & \\ Y_{n1} & & Y_{nm} \end{bmatrix} \]
Ultimate goal: identify cell types.
\[ \begin{bmatrix} Y_{11} & \dots & Y_{1m} & Z_1\\ \vdots & \ddots & & \vdots \\ Y_{n1} & & Y_{nm} & Z_n \end{bmatrix} \]
Assumption: only 2 cell types!
\[ Z_i \in \{0, 1\} \]
Use cell-type marker genes to identify cell types.
a potential cell-type marker
a potential set of cell-type markers
top DE genes identified by hypothesis tests
Figure S16
However, we used the data twice.
We clustered the cells.
We tested each gene to see how different the expression levels are between the two clusters. However, we already know that this variation is there.
In other words: we forced the genes to exhibit different distributions in the two clusters.
When our clusters are bad, we open ourselves up to false discoveries.
We test in order to discover differentially expressed genes.
False discoveries are genes that are truly similar between cell types.
\[ \text{FDR} := E[\frac{V}{R}] \]
\[ \text{FDR} := E[\frac{\text{number of false rejections}}{\text{total number of rejections}}] \]
Problem: reliance on clustering quality inflates the FDR.
A graphical illustration of ClusterDE.
Intuition: compute differences between the real data and a concrete null hypothesis.
Differences from this synthetic data imply differentially expressed genes.
rmvnb()?Assumption: each gene follows a negative binomial distribution.
\[ Y_j \sim NB(\mu_j, \sigma_j^2) \]
But our data doesn’t look like this:
rmvnb()Null generation steps.
Theorem (Probability Integral Transform): \(F_X(X) \sim \text{Uniform}(0, 1)\).
Proof: the standard proof of the PIT found on the Wikipedia page.
\[ \begin{align} F_Y(y) &= P(Y \leq y)\\ &= P(F_X(X) \leq y) && \text{(substituted the definition of } Y)\\ &= P(X \leq F_X^{-1}(y)) && \text{(applied } F_X^{-1} \text{ to both sides)}\\ &= F_X(F_X^{-1}(y)) && \text{(the definition of a CDF)}\\ &= y \end{align} \]
We can transform a random variable into a different random variable using their CDFs.
\[U = F_X(X) \sim \text{Uniform}(0, 1)\]
\[X = F_X^{-1}(U)\]
Theorem (Sklar’s Theorem): Let \(\mathbf{X}\) be a \(m\)-dimensional random vector with joint cumulative distribution function \(F\) and marginal distribution functions \(F_j, j = 1, ..., m\). The joint CDF can be expressed as
\[ F(x_1, ..., x_m) = C(F_1(x_1), ..., F_m(x_m)) \]
with associated probability density (or mass) function
\[ f(x_1, ..., x_m) = c(F_1(x_1), ..., F_m(x_m)) f_1(x_1) ... f_m(x_m) \]
for a \(m\)-dimensional copula \(C\) with copula density \(c\).
The inverse also holds: the copula corresponding to a multivariate CDF \(F\) with marginal distribution functions \(F_j, j = 1, ..., m\) can be expressed as
\[ C(u_1, ..., u_m) = F(F_1^{-1}(u_1), ..., F_m^{-1}(u_m)) \] , and the copula density (or mass) function is
\[ c(u_1,...,u_m) = \frac{f(F_1^{-1}(u_1), ..., F_m^{-1}(u_m))} {f_1(F_1^{-1}(u_1)) ... f_m(F_m^{-1}(u_m))} \] .
Sklar’s theorem allows us to separately model the individual variables and their dependence.
For convenience, we will choose the Gaussian copula.
\[ C(\mathbf{u}; \mathbf{R}) = \Phi_{\mathbf{R}}(\Phi^{-1}(u_1), ..., \Phi^{-1}(u_m)) \]
Now, our goal is to estimate:
the marginal parameters \(\{\mu_j, \sigma_j\}_{j = 1}^m\)
the covariance matrix\(\mathbf{R}\)
For each gene \(j\), estimate \(\{\mu_j, \sigma_j\}\) using maximum likelihood. These are the marginal distributions.
\[ \begin{bmatrix} \tilde{X}_{11} & \dots & \tilde{X}_{1m} \\ \vdots & \ddots & \\ \tilde{X}_{n1} & & \tilde{X}_{nm} \end{bmatrix} \]
\[ \begin{bmatrix} \hat{F}_1^{-1}(\Phi(\tilde{X}_{11})) & \dots & \hat{F}_m^{-1}(\Phi(\tilde{X}_{1m})) \\ \vdots & \ddots & \\ \hat{F}_1^{-1}(\Phi(\tilde{X}_{n1})) & & \hat{F}_m^{-1}(\Phi(\tilde{X}_{nm})) \end{bmatrix} = \begin{bmatrix} \tilde{Y}_{11} & \dots & \tilde{Y}_{1m} \\ \vdots & \ddots & \\ \tilde{Y}_{n1} & & \tilde{Y}_{nm} \end{bmatrix} \]
Null generation steps.
Clustering step
Hypothesis testing step
Given the target and null DE scores, compute a contrast score for gene \(j\) as \(C_j := S_j - \tilde{S}_j\).
FDR control.
\[ t* = \min\left\{t \in \{|C_j|: C_j \neq 0\}: \frac{1+\#\{j:C_j \leq -t\}}{\#\{j:C_j \geq t\} \lor 1} \leq q\right\} \]
Real data
homogeneous cell line (e.g. H2228): no DE genes
different cell types: biologically meaningful DE genes
Choose 2 from this large set
Chose 2 that we know are similar
some top DE genes have similar expression levels
identified more biologically meaningful DE genes
Knockoffs
Change setting to prediction of cell cluster label
Generate features that are uncorrelated with the response
Permutation:
Power and FDR comparisons
We want to identify cell types. DE identifies candidates for cell-type marker genes.
Traditional DE discovery methods double-dip, increasing the risk of false discovery.
ClusterDE generates a synthetic null dataset before the pipeline, and combines the parallel results in a meaningful way using Clipper.
ClusterDE performs in simulations and real data, when there are and aren’t distinct cell types.
The copula strategy for null generation has better power and FDR.
Comparison of synthetic null generation strategies.
Discovering DE genes helps us generate cell-type markers, which identify the cell types in our sample.
Naive discovery methods (e.g. Seurat) double-dip: they cluster the data, then perform hypothesis tests using those clusters. This opens up an increased risk of false discoveries.
ClusterDE outperforms other methods when - there is only one cell type, by identifying few DE genes - there are distinct cell types, by identifying more meaningful DE genes
The copula generation strategy outperforms other alternatives in terms of FDR and power.
H2228: “the gold standard for benchmarking the accuracy of clustering.”
One true cluster
Successfully identified no DE genes
Separately, use the Gaussian copula to model the dependence structure.
We only consider two cell types, so \(Z \in \{0, 1\}\)
We want to test
\[H_{0j} : \mu_{Z = 0, j} = \mu_{Z = 1, j}\]
But we can only test \[H_{0j}^{DD} : \mu_{\hat{Z} = 0, j} = \mu_{\hat{Z} = 1, j}\]
False discoveries occur when this does NOT hold
\[H_{0j}^{DD} : \mu_{\hat{Z} = 0, j} = \mu_{\hat{Z} = 1, j}\] , but this holds \[H_{0j} : \mu_{Z = 0, j} = \mu_{Z = 1, j}\]
In words: we made the right decision, but we set up the wrong test.